Learning policies from fixed offline datasets is a key challenge to scale up reinforcement learning (RL) algorithms towards practical applications. This is often because off-policy RL algorithms suffer from distributional shift, due to mismatch between dataset and the target policy, leading to high variance and over-estimation of value functions. In this work, we propose variance regularization for offline RL algorithms, using stationary distribution corrections. We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer. The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms. We show that the regularizer leads to a lower bound to the offline policy optimization objective, which can help avoid over-estimation errors, and explains the benefits of our approach across a range of continuous control domains when compared to existing state-of-the-art algorithms.
translated by 谷歌翻译
Developing robots that are capable of many skills and generalization to unseen scenarios requires progress on two fronts: efficient collection of large and diverse datasets, and training of high-capacity policies on the collected data. While large datasets have propelled progress in other fields like computer vision and natural language processing, collecting data of comparable scale is particularly challenging for physical systems like robotics. In this work, we propose a framework to bridge this gap and better scale up robot learning, under the lens of multi-task, multi-scene robot manipulation in kitchen environments. Our framework, named CACTI, has four stages that separately handle data collection, data augmentation, visual representation learning, and imitation policy training. In the CACTI framework, we highlight the benefit of adapting state-of-the-art models for image generation as part of the augmentation stage, and the significant improvement of training efficiency by using pretrained out-of-domain visual representations at the compression stage. Experimentally, we demonstrate that 1) on a real robot setup, CACTI enables efficient training of a single policy capable of 10 manipulation tasks involving kitchen objects, and robust to varying layouts of distractor objects; 2) in a simulated kitchen environment, CACTI trains a single policy on 18 semantic tasks across up to 50 layout variations per task. The simulation task benchmark and augmented datasets in both real and simulated environments will be released to facilitate future research.
translated by 谷歌翻译
尽管学习环境内部模型的强化学习(RL)方法具有比没有模型的对应物更有效的样本效率,但学会从高维传感器中建模原始观察结果可能具有挑战性。先前的工作通过通过辅助目标(例如重建或价值预测)学习观察值的低维表示来解决这一挑战。但是,这些辅助目标与RL目标之间的一致性通常不清楚。在这项工作中,我们提出了一个单一的目标,该目标共同优化了潜在空间模型和政策,以实现高回报,同时保持自洽。这个目标是预期收益的下限。与基于模型的RL在策略探索或模型保证方面的先前范围不同,我们的界限直接依靠整体RL目标。我们证明,所得算法匹配或改善了最佳基于模型和无模型的RL方法的样品效率。尽管这种有效的样品方法通常在计算上是要求的,但我们的方法在较小的壁式锁定时间降低了50 \%。
translated by 谷歌翻译
在部署之前审核培训的深度学习(DL)模型对于防止意外后果至关重要。审计中最大的挑战之一是对审计员直接有用的DL模型缺乏人类可解释规范。我们通过一系列语义对齐单元测试来解决这一挑战,其中每个单元测试验证是否满足了对输入空间中的受控和语义对齐的变化(例如,在脸部)的定义规范(例如,超过95%超过95%)的挑战。识别,相对于相机的角度)。我们通过生成模型的语义可解释的潜在空间中的变化来实现这样的单元测试。此外,我们通过与生成模型的共享潜在空间表示对DL模型进行认证培训。在四个不同的数据集中进行评估,涵盖胸部X射线,人脸,想象成型类和塔的图像,我们展示了奥迪伊伊允许我们如何获得经过认证培训的受控变化。因此,我们的框架Auditai桥接了语义对齐的正式验证和可扩展性之间的差距。伴随着本文的博客帖子在此链接https://developer.nvidia.com/blog/nvidia-research-auditing-a-models-for-verified-deployment-under-semantic-pecifications
translated by 谷歌翻译
从视觉数据中学习通过利用人类示范来开辟了大量的操纵行为,而不在数学上的数学上指定每个人,而是通过自然任务规范。在本文中,我们通过观看(LBW),通过模仿任务的单个视频来模仿策略学习的算法框架。我们方法的关键见解是两倍。首先,由于人类武器可能与机器人武器不具有相同的形态,因此我们的框架就会学会无监督的人类来机器人翻译,以克服形态不匹配问题。其次,为了捕获对学习状态表示至关重要的突出区域中的细节,我们的模型在翻译的机器人视频上执行无监督的关键点检测。检测到的关键点形成包含语义有意义信息的结构化表示,可以直接用于计算奖励和策略学习。我们在五个机器人操作任务中评估我们的LBW框架的有效性,包括到达,推动,滑动,咖啡制作和抽屉关闭。广泛的实验评估表明,我们的方法对最先进的方法有利地表现出。
translated by 谷歌翻译